fix(test): resolve thread leak failures in CI#1533
Conversation
Greptile SummaryThis PR fixes two pre-existing thread-leak CI failures by (1) adding an explicit Key changes and observations:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Test as Test Thread
participant Watchdog as Watchdog Thread (_watch_process)
participant Process as Native Subprocess
participant Module as NativeModule.stop()
participant Base as Module._close_module()
Test->>Process: mod.start() → subprocess spawned
Test->>Watchdog: watchdog thread started (daemon=True)
Process-->>Watchdog: process exits (die_after=0.2)
Watchdog->>Module: self.stop() [_stopping=False → crash path]
Module->>Module: _stopping = True
Module->>Module: self._watchdog is current_thread() → skip join
Module->>Module: self._watchdog = None, self._process = None
Module->>Base: super().stop() → _close_module()
Base->>Base: _module_closed_lock acquired, joins run_forever / _lcm_loop threads
Watchdog-->>Watchdog: _watch_process() returns → thread terminates
Test->>Test: poll: mod._process is None → break
Test->>Module: mod.stop() [idempotent]
Module->>Module: _stopping already True, _process already None, _watchdog already None
Module->>Base: super().stop() → _close_module()
Base->>Base: _module_closed already True → early return (no-op)
Note over Test,Base: monitor_threads fixture sees no leaked threads ✓
Last reviewed commit: 021be1f |
dimos/conftest.py
Outdated
| # Filter out third-party daemon threads with generic names (e.g. "Thread-109"). | ||
| # On Python 3.12+ our own threads include the target function name in parens | ||
| # (e.g. "Thread-166 (run_forever)"), so this only matches unnamed threads | ||
| # from libraries like torch/HuggingFace that have no cleanup API. | ||
| new_threads = [ | ||
| t for t in new_threads if not (t.daemon and re.fullmatch(r"Thread-\d+", t.name)) | ||
| ] | ||
|
|
||
| # Filter out threads we've already seen (from previous tests) |
There was a problem hiding this comment.
Filter silently suppresses leaks on Python < 3.12
The comment correctly explains the Python 3.12+ behaviour where threading.Thread(target=fn) embeds fn.__name__ in the thread name (e.g. "Thread-166 (run_forever)"). On Python 3.11 and earlier, auto-named threads are just "Thread-N" regardless of whether a target was passed, so every daemon thread started by project code without an explicit name= argument would also be silently skipped by this filter.
If the CI exclusively runs Python 3.12+ this is fine as written, but it's worth documenting that assumption or adding a Python-version guard so the filter doesn't inadvertently hide regressions on older interpreters:
import sys
_FILTER_GENERIC_DAEMON_THREADS = sys.version_info >= (3, 12)
# Filter out third-party daemon threads with generic names (e.g. "Thread-109").
# On Python 3.12+ our own threads include the target function name in parens
# (e.g. "Thread-166 (run_forever)"), so this only matches unnamed threads
# from libraries like torch/HuggingFace that have no cleanup API.
# NOTE: On Python < 3.12 all auto-named threads share this format, so the
# filter is intentionally disabled there to preserve leak detection.
if _FILTER_GENERIC_DAEMON_THREADS:
new_threads = [
t
for t in new_threads
if not (t.daemon and re.fullmatch(r"Thread-\d+", t.name))
]|
im cnfused here tho becuase it only fails CI SOME of the time |
|
@paul-nechifor would know why this is flakey |
dimos/conftest.py
Outdated
| # On Python 3.12+ our own threads include the target function name in parens | ||
| # (e.g. "Thread-166 (run_forever)"), so this only matches unnamed threads | ||
| # from libraries like torch/HuggingFace that have no cleanup API. | ||
| new_threads = [ |
There was a problem hiding this comment.
This is sweeping the problem under the rug. If there's a specific thread leak we can't fix because it's from a third party library, we should ignore that issue alone, not all generic thread names, because we use generic thread names too.
dimos/core/test_native_module.py is a good fix (although a fixture would be better), but the other change is not. Do you know where those tests are failing in CI? I haven't seen it. |
- Add mod.stop() to test_process_crash_triggers_stop so watchdog, LCM, and event-loop threads are properly joined from the test thread - Filter third-party daemon threads with generic names (Thread-\d+) in conftest monitor_threads to ignore torch/HF background threads that have no cleanup API
Convert test_process_crash_triggers_stop to use a fixture that calls mod.stop() in teardown. The watchdog thread calls self.stop() but can't join itself, so an explicit stop() from the test thread is needed to properly clean up all threads. Drop the broad conftest regex filter for generic daemon thread names per review feedback.
mod.stop() is a no-op when the watchdog already called it, so capture thread IDs before the test and join new ones in teardown.
3f2eb91 to
3197ad3
Compare
Problem
test_process_crash_triggers_stopcan flakily fail CI withNon-closed threads created during this testfor threadsrun_forever,_lcm_loop,_watch_process.The test waits for the watchdog to detect a process crash and call
stop(), but the watchdog callsstop()from its own thread and can't join itself — so its thread (and sometimes the LCM/event-loop threads) may still be alive when themonitor_threadsfixture checks.Solution
Convert the test to use a pytest fixture that calls
mod.stop()in teardown. This joins all threads from the test's main thread.stop()is idempotent so it's safe even after the watchdog already called it.Breaking Changes
None
How to Test
Contributor License Agreement